Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Adaptive Detection of Missed Text Areas in OCR Outputs: Application to the Automatic Assessment of OCR quality in Mass Digitization projects

Identifieur interne : 000152 ( Main/Exploration ); précédent : 000151; suivant : 000153

Adaptive Detection of Missed Text Areas in OCR Outputs: Application to the Automatic Assessment of OCR quality in Mass Digitization projects

Auteurs : Ahmed Ben Salah [France] ; Nicolas Ragot [France] ; Thierry Paquet [France]

Source :

RBID : Hal:hal-00820564

Descripteurs français

Abstract

The French National Library (BnF ) has launched many mass digitization projects in order to give access to its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). OCR softwares have become increasingly complex systems composed of several subsystems dedicated to the analysis and the recognition of the elements in a page. However, the reliability of these systems is always an issue at stake. Indeed, in some cases, we can nd errors in OCR outputs that occur because of an accumulation of several errors at di erent levels in the OCR process. One of the frequent errors in OCR outputs is the missed text components. The presence of such errors may lead to severe defects in digital libraries. In this paper, we investigate the detection of missed text components to control the OCR results from the collections of the French National Library. Our veri cation approach uses local information inside the pages based on Radon transform descriptors and Local Binary Patterns descriptors (LBP) coupled with OCR results to control their consistency. The experimental results show that our method detects 84.15% of the missed textual components, by comparing the OCR ALTO les outputs (produced by the service providers) to the images of the document.

Url:
DOI: 10.1117/12.2003733


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Adaptive Detection of Missed Text Areas in OCR Outputs: Application to the Automatic Assessment of OCR quality in Mass Digitization projects</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300317" type="direct">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="direct">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="direct">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Bourgogne</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ragot, Nicolas" sort="Ragot, Nicolas" uniqKey="Ragot N" first="Nicolas" last="Ragot">Nicolas Ragot</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Paquet, Thierry" sort="Paquet, Thierry" uniqKey="Paquet T" first="Thierry" last="Paquet">Thierry Paquet</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300317" type="direct">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="direct">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="direct">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Bourgogne</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00820564</idno>
<idno type="halId">hal-00820564</idno>
<idno type="halUri">https://hal-bnf.archives-ouvertes.fr/hal-00820564</idno>
<idno type="url">https://hal-bnf.archives-ouvertes.fr/hal-00820564</idno>
<idno type="doi">10.1117/12.2003733</idno>
<date when="2013-02-05">2013-02-05</date>
<idno type="wicri:Area/Hal/Corpus">000018</idno>
<idno type="wicri:Area/Hal/Curation">000018</idno>
<idno type="wicri:Area/Hal/Checkpoint">000049</idno>
<idno type="wicri:Area/Main/Merge">000154</idno>
<idno type="wicri:Area/Main/Curation">000152</idno>
<idno type="wicri:Area/Main/Exploration">000152</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Adaptive Detection of Missed Text Areas in OCR Outputs: Application to the Automatic Assessment of OCR quality in Mass Digitization projects</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300317" type="direct">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="direct">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="direct">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Bourgogne</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ragot, Nicolas" sort="Ragot, Nicolas" uniqKey="Ragot N" first="Nicolas" last="Ragot">Nicolas Ragot</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Paquet, Thierry" sort="Paquet, Thierry" uniqKey="Paquet T" first="Thierry" last="Paquet">Thierry Paquet</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300317" type="direct">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="direct">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="direct">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Bourgogne</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
</affiliation>
</author>
</analytic>
<idno type="DOI">10.1117/12.2003733</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="fr">
<term>OCR results assessment</term>
<term>Text detection</term>
<term>Texture characterization</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The French National Library (BnF ) has launched many mass digitization projects in order to give access to its collection. The indexation of digital documents on Gallica (digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition softwares (OCR). OCR softwares have become increasingly complex systems composed of several subsystems dedicated to the analysis and the recognition of the elements in a page. However, the reliability of these systems is always an issue at stake. Indeed, in some cases, we can nd errors in OCR outputs that occur because of an accumulation of several errors at di erent levels in the OCR process. One of the frequent errors in OCR outputs is the missed text components. The presence of such errors may lead to severe defects in digital libraries. In this paper, we investigate the detection of missed text components to control the OCR results from the collections of the French National Library. Our veri cation approach uses local information inside the pages based on Radon transform descriptors and Local Binary Patterns descriptors (LBP) coupled with OCR results to control their consistency. The experimental results show that our method detects 84.15% of the missed textual components, by comparing the OCR ALTO les outputs (produced by the service providers) to the images of the document.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Centre-Val de Loire</li>
<li>Région Bourgogne</li>
<li>Région Centre</li>
</region>
<settlement>
<li>Rouen</li>
<li>Tours</li>
</settlement>
<orgName>
<li>Centre Val de Loire Université</li>
<li>Université François-Rabelais de Tours</li>
<li>Université de Rouen</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Région Bourgogne">
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
</region>
<name sortKey="Paquet, Thierry" sort="Paquet, Thierry" uniqKey="Paquet T" first="Thierry" last="Paquet">Thierry Paquet</name>
<name sortKey="Ragot, Nicolas" sort="Ragot, Nicolas" uniqKey="Ragot N" first="Nicolas" last="Ragot">Nicolas Ragot</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000152 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000152 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-00820564
   |texte=   Adaptive Detection of Missed Text Areas in OCR Outputs: Application to the Automatic Assessment of OCR quality in Mass Digitization projects
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024